stereo image
Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses
Stereo depth estimation is a fundamental component in augmented reality (AR) applications. Although AR applications require very low latency for their real-time applications, traditional depth estimation models often rely on time-consuming preprocessing steps such as rectification to achieve high accuracy. Also, non standard ML operator based algorithms such as cost volume also require significant latency, which is aggravated on compute resource-constrained mobile platforms. Therefore, we develop hardware-friendly alternatives to the costly cost volume and preprocessing and design two new models based on them, MultiHeadDepth and HomoDepth. Our approaches for cost volume is replacing it with a new group-pointwise convolution-based operator and approximation of consine similarity based on layernorm and dot product. For online stereo rectification (preprocessing), we introduce homograhy matrix prediction network with a rectification positional encoding (RPE), which delivers both low latency and robustness to unrectified images, which eliminates the needs for preprocessing. Our MultiHeadDepth, which includes optimized cost volume, provides 11.8-30.3% improvements in accuracy and 22.9-25.2% reduction in latency compared to a state-of-the-art depth estimation model for AR glasses from industry. Our HomoDepth, which includes optimized preprocessing (Homograhpy + RPE) upon MultiHeadDepth, can process unrectified images and reduce the end-to-end latency by 44.5%. We adopt a multi-task learning framework to handle misaligned stereo inputs on HomoDepth, which reduces theAbsRel error by 10.0-24.3%. The results demonstrate the efficacy of our approaches in achieving both high model performance with low latency, which makes a step forward toward practical depth estimation on future AR devices.
SDI-Net: Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement
Hu, Linlin, Sun, Ao, Hao, Shijie, Hong, Richang, Wang, Meng
Currently, most low-light image enhancement methods only consider information from a single view, neglecting the correlation between cross-view information. Therefore, the enhancement results produced by these methods are often unsatisfactory. In this context, there have been efforts to develop methods specifically for low-light stereo image enhancement. These methods take into account the cross-view disparities and enable interaction between the left and right views, leading to improved performance. However, these methods still do not fully exploit the interaction between left and right view information. To address this issue, we propose a model called Toward Sufficient Dual-View Interaction for Low-light Stereo Image Enhancement (SDI-Net). The backbone structure of SDI-Net is two encoder-decoder pairs, which are used to learn the mapping function from low-light images to normal-light images. Among the encoders and the decoders, we design a module named Cross-View Sufficient Interaction Module (CSIM), aiming to fully exploit the correlations between the binocular views via the attention mechanism. The quantitative and visual results on public datasets validate the superiority of our method over other related methods. Ablation studies also demonstrate the effectiveness of the key elements in our model.
ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer
Ding, Tianye, Li, Hongyu, Jiang, Huaizu
Obstacle detection and tracking represent a critical component in robot autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based model to address both obstacle detection and tracking problems. For the detection task, our approach leverages deformable attention to construct a 3D cost volume, which is decoded progressively in the form of voxel occupancy grids. We further track the obstacles by matching the voxels between consecutive frames. The entire model can be optimized in an end-to-end manner. Through extensive experiments on DrivingStereo and KITTI benchmarks, our model achieves state-of-the-art performance in the obstacle detection task. We also report comparable accuracy to state-of-the-art obstacle tracking models while requiring only a fraction of their computation cost, typically ten-fold to twenty-fold less. The code and model weights will be publicly released.
Landmark Stereo Dataset for Landmark Recognition and Moving Node Localization in a Non-GPS Battlefield Environment
Sapkota, Ganesh, Madria, Sanjay
In this paper, we have proposed a new strategy of using the landmark anchor node instead of a radio-based anchor node to obtain the virtual coordinates (landmarkID, DISTANCE) of moving troops or defense forces that will help in tracking and maneuvering the troops along a safe path within a GPS-denied battlefield environment. The proposed strategy implements landmark recognition using the Yolov5 model and landmark distance estimation using an efficient Stereo Matching Algorithm. We consider that a moving node carrying a low-power mobile device facilitated with a calibrated stereo vision camera that captures stereo images of a scene containing landmarks within the battlefield region whose locations are stored in an offline server residing within the device itself. We created a custom landmark image dataset called MSTLandmarkv1 with 34 landmark classes and another landmark stereo dataset of those 34 landmark instances called MSTLandmarkStereov1. We trained the YOLOv5 model with MSTLandmarkv1 dataset and achieved 0.95 mAP @ 0.5 IoU and 0.767 mAP @ [0.5: 0.95] IoU. We calculated the distance from a node to the landmark utilizing the bounding box coordinates and the depth map generated by the improved SGM algorithm using MSTLandmarkStereov1. The tuple of landmark IDs obtained from the detection result and the distances calculated by the SGM algorithm are stored as the virtual coordinates of a node. In future work, we will use these virtual coordinates to obtain the location of a node using an efficient trilateration algorithm and optimize the node position using the appropriate optimization method.
3D Skeletonization of Complex Grapevines for Robotic Pruning
Schneider, Eric, Jayanth, Sushanth, Silwal, Abhisesh, Kantor, George
Robotic pruning of dormant grapevines is an area of active research in order to promote vine balance and grape quality, but so far robotic efforts have largely focused on planar, simplified vines not representative of commercial vineyards. This paper aims to advance the robotic perception capabilities necessary for pruning in denser and more complex vine structures by extending plant skeletonization techniques. The proposed pipeline generates skeletal grapevine models that have lower reprojection error and higher connectivity than baseline algorithms. We also show how 3D and skeletal information enables prediction accuracy of pruning weight for dense vines surpassing prior work, where pruning weight is an important vine metric influencing pruning site selection.
StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS
Chen, Kai, James, Stephen, Sui, Congying, Liu, Yun-Hui, Abbeel, Pieter, Dou, Qi
Most existing methods for category-level pose estimation rely on object point clouds. However, when considering transparent objects, depth cameras are usually not able to capture meaningful data, resulting in point clouds with severe artifacts. Without a high-quality point cloud, existing methods are not applicable to challenging transparent objects. To tackle this problem, we present StereoPose, a novel stereo image framework for category-level object pose estimation, ideally suited for transparent objects. For a robust estimation from pure stereo images, we develop a pipeline that decouples category-level pose estimation into object size estimation, initial pose estimation, and pose refinement. StereoPose then estimates object pose based on representation in the normalized object coordinate space~(NOCS). To address the issue of image content aliasing, we further define a back-view NOCS map for the transparent object. The back-view NOCS aims to reduce the network learning ambiguity caused by content aliasing, and leverage informative cues on the back of the transparent object for more accurate pose estimation. To further improve the performance of the stereo framework, StereoPose is equipped with a parallax attention module for stereo feature fusion and an epipolar loss for improving the stereo-view consistency of network predictions. Extensive experiments on the public TOD dataset demonstrate the superiority of the proposed StereoPose framework for category-level 6D transparent object pose estimation.
End-to-end deep multi-score model for No-reference stereoscopic image quality assessment
Messai, Oussama, Chetouani, Aladine
Deep learning-based quality metrics have recently given significant improvement in Image Quality Assessment (IQA). In the field of stereoscopic vision, information is evenly distributed with slight disparity to the left and right eyes. However, due to asymmetric distortion, the objective quality ratings for the left and right images would differ, necessitating the learning of unique quality indicators for each view. Unlike existing stereoscopic IQA measures which focus mainly on estimating a global human score, we suggest incorporating left, right, and stereoscopic objective scores to extract the corresponding properties of each view, and so forth estimating stereoscopic image quality without reference. Therefore, we use a deep multi-score Convolutional Neural Network (CNN). Our model has been trained to perform four tasks: First, predict the left view's quality. Second, predict the quality of the left view. Third and fourth, predict the quality of the stereo view and global quality, respectively, with the global score serving as the ultimate quality. Experiments are conducted on Waterloo IVC 3D Phase 1 and Phase 2 databases. The results obtained show the superiority of our method when comparing with those of the state-of-the-art. The implementation code can be found at: https://github.com/o-messai/multi-score-SIQA
Robust and accurate depth estimation by fusing LiDAR and Stereo
Xu, Guangyao, Fan, Junfeng, Li, En, Long, Xiaoyu, Guo, Rui
Depth estimation is one of the key technologies in some fields such as autonomous driving and robot navigation. However, the traditional method of using a single sensor is inevitably limited by the performance of the sensor. Therefore, a precision and robust method for fusing the LiDAR and stereo cameras is proposed. This method fully combines the advantages of the LiDAR and stereo camera, which can retain the advantages of the high precision of the LiDAR and the high resolution of images respectively. Compared with the traditional stereo matching method, the texture of the object and lighting conditions have less influence on the algorithm. Firstly, the depth of the LiDAR data is converted to the disparity of the stereo camera. Because the density of the LiDAR data is relatively sparse on the y-axis, the converted disparity map is up-sampled using the interpolation method. Secondly, in order to make full use of the precise disparity map, the disparity map and stereo matching are fused to propagate the accurate disparity. Finally, the disparity map is converted to the depth map. Moreover, the converted disparity map can also increase the speed of the algorithm. We evaluate the proposed pipeline on the KITTI benchmark. The experiment demonstrates that our algorithm has higher accuracy than several classic methods.
Object Disparity
Most of stereo vision works are focusing on computing the dense pixel disparity of a given pair of left and right images. A camera pair usually required lens undistortion and stereo calibration to provide an undistorted epipolar line calibrated image pair for accurate dense pixel disparity computation. Due to noise, object occlusion, repetitive or lack of texture and limitation of matching algorithms, the pixel disparity accuracy usually suffers the most at those object boundary areas. Although statistically the total number of pixel disparity errors might be low (under 2% according to the Kitti Vision Benchmark of current top ranking algorithms), the percentage of these disparity errors at object boundaries are very high. This renders the subsequence 3D object distance detection with much lower accuracy than desired. This paper proposed a different approach for solving a 3D object distance detection by detecting object disparity directly without going through a dense pixel disparity computation. An example squeezenet Object Disparity-SSD (OD-SSD) was constructed to demonstrate an efficient object disparity detection with comparable accuracy compared with Kitti dataset pixel disparity ground truth. Further training and testing results with mixed image dataset captured by several different stereo systems may suggest that an OD-SSD might be agnostic to stereo system parameters such as a baseline, FOV, lens distortion, even left/right camera epipolar line misalignment.
A Deeper Insight into the UnDEMoN: Unsupervised Deep Network for Depth and Ego-Motion Estimation
Babu, Madhu V, Majumder, Anima, Das, Kaushik, Kumar, Swagat
This paper presents an unsupervised deep learning framework called UnDEMoN for estimating dense depth map and 6-DoF camera pose information directly from monocular images. The proposed network is trained using unlabeled monocular stereo image pairs and is shown to provide superior performance in depth and ego-motion estimation compared to the existing state-of-the-art. These improvements are achieved by introducing a new objective function that aims to minimize spatial as well as temporal reconstruction losses simultaneously. These losses are defined using bi-linear sampling kernel and penalized using the Charbonnier penalty function. The objective function, thus created, provides robustness to image gradient noises thereby improving the overall estimation accuracy without resorting to any coarse to fine strategies which are currently prevalent in the literature. Another novelty lies in the fact that we combine a disparity-based depth estimation network with a pose estimation network to obtain absolute scale-aware 6 DOF Camera pose and superior depth map. The effectiveness of the proposed approach is demonstrated through performance comparison with the existing supervised and unsupervised methods on the KITTI driving dataset.